October 7, 2015
rnorm(100)No seed
rnorm(5); rnorm(5)
## [1] 0.1552290 0.7028124 1.0625147 -1.0643773 0.1942986
## [1] 0.3340692 1.6133966 -0.9996593 1.5276916 0.1277846
Seed
set.seed(1); rnorm(5)
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
set.seed(1); rnorm(5)
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
rbinom, rnorm, rpois…mvrnorm from the MASS packagesamplesample_n from the dplyr packagesample_frac from the dplyr packagehttp://simplystatistics.org/2013/03/06/the-importance-of-simulating-the-extremes/
rnbinomhttp://bioinformatics.oxfordjournals.org/content/early/2015/04/28/bioinformatics.btv272.abstract
http://bioinformatics.oxfordjournals.org/content/early/2015/02/26/bioinformatics.btv124
http://biostatistics.oxfordjournals.org/content/early/2014/01/06/biostatistics.kxt053.full
P-values, although not essential for doing tests, are nearly ubiquitous in applied work. They are a very useful "shorthand" and you should understand them.
Under the convention that larger test statistics occur farther from \(H_0\), for observed data \(Y\) we define: \[p = p(Y) = Pr_{Y'\sim F|H_0}[T(Y') > T(Y)]\] i.e. the long-run proportion of datasets, under the null, which are "more extreme" than the observed \(Y\).
What's the distribution of a \(p\)-value? Under the null:
\[Pr_{Y \sim F|H_0}[p(Y) < \alpha] = Pr_{Y \sim F|H_0}\left[Pr_{Y' \sim F|H_0}[T(Y') > T(Y)] \leq \alpha \right]\] \[= Pr{Y \sim F|H_0}[ 1 - \mathcal{F}(T(Y)) \leq \alpha]\] \[= Pr{Y \sim F|H_0}[T(Y) > \mathcal{F}^{-1}(1-\alpha)]\] \[= 1 - \mathcal{F}(\mathcal{F}^{-1}(1-\alpha)) = \alpha\]
where \(\mathcal{F}(\cdot)\) denotes the cumulative distribution function of \(T(Y)\) under the null.
There is a tremendous confusion over \(p\)-values. The following hold very generally
Suppose we observe survival times on 16 mice in a treatment and a control group: \[X = (94, 197, 16, 38, 99, 141, 23)\] \[Y = (52, 104, 146, 10, 51, 30, 40, 27, 46)\] \(X \sim F\) and the values \(Y \sim G\) and we want to test: \(H_0: F = G\). The difference of means is \(\hat{\theta} = \bar{x} - \bar{y} = 30.63\). One way to calculate a p-value would be to make a parametric assumption.
Another clever way, devised by Fisher is permutation. Define the vector \(g\) to be the group labels \(g_i = 1\) if treatment (X) and \(g_i = 0\) if control (Y). Then pool all of the observations together into a vector \(v = (94, 197, 16, 38, 99, 141, 23, 52, 104, 146, 10, 51, 30, 40, 27, 46)\)
If there are \(n\) samples from \(F\) and \(m\) samples from \(y\) for a total of \(N = m + n\) samples, then there are \({N}\choose{n}\) possible ways to assign the group labels.
Under \(F = G\) it is possible to show that - conditional on the observed values - each of these is equally likely. So we can write our statistic \[\hat{\theta} = \bar{x} - \bar{y} = \frac{1}{n} \sum_{g_i = 1} v_i - \frac{1}{m}\sum_{g_i=0} v_i\]
The permutation null distribution of \(\hat{\theta}\) is calculated by forming all permutations of \(g\) and recalculating the statistic.
\[\hat{p}_{perm} = \left(\#\{|\hat{\theta}^{b}| \geq |\hat{\theta}| \} + 1\right) /(B+1)\]
Bootstrapping is a computational procedure for:
The basic idea behind bootstrapping is to treat the sample of data you observe (or some appropriate function of the data) as if it was the super-population you were sampling from. Then you sample from the observed set of values to try to approximate the sampling variation in the whole population.
The idea of the bootstrap was originally proposed by Efron in 1979
Related ideas are very old by the standards of statistics (Quenouille, 1956 Notes on bias in estimation. Tukey, 1958 Bias and confidence in not-quite large samples)


The plug-in principle states that if we have a parameter \(\theta = t(F)\), then we estimate the parameter by applying the same functional to an estimate of the distribution function \(\hat{\theta} = t(\hat{F}_n)\). Although other estimates can also be used The default \(\hat{F}_n = {\mathbb F}_n\) is the empirical distribution \[ {\mathbb F}_n(y) = \frac{1}{n} \sum_{i=1}^n 1(Y_i \leq y)\] A sample \(Y_i^+\) from \({\mathbb F}_n\) has the property that \(Y_i^+ = Y_j\) with probability \(1/n\) for \(1 \leq j \leq n\)
Suppose \(Y_i > 0\) are i.i.d. from \(F\). We might be interested in: \[ \theta = {\rm E}_F \log(Y)\] - the log of the geometric mean of F. A plug in estimate of \(\theta\) is: \[\hat{\theta} = {\rm E}_{{\mathbb F}_n} \log(Y^+)\] which is actually available in closed form: \[\hat{\theta} = \frac{1}{n} \sum_{i=1}^n \log(Y_i)\]
Let \(Y\) be a binary variable and suppose \[\theta(F) = Pr(Y = 1) = {\rm e}_F[1(Y_i = 1)]\] We can find the estimate of \(\theta(F)\) using the plug-in principle: \[\hat{\theta} = {\rm e}_{{\mathbb F}_n}[1(Y_i^+ = 1)] = \bar{Y}\] Suppose we wanted an estimate for the variance \[{\rm Var}(\hat{\theta}) = {\rm Var}(\bar{Y}) = {\rm Var}(Y_i)/n\]
We could use the plug in estimator \[{\rm Var}_{{\mathbb F}_n}(Y_i)/n = {\rm e}_{{\mathbb F}_n}[(Y_i^+ - \bar{Y})^2]/n = \frac{Ar{Y}(1-\bar{Y})}{n}\]
When evaluating \({\rm E}_{{\mathbb F}_n}\), \(\bar{Y}\) is a ``parameter'' and treated as fixed.
Usually, no closed form evaluation will exist (we sort of "got lucky" in the previous examples). How did we get lucky? The plug-in estimate ended up being an expectation of a single random variable \(Y_i^+\).
What if we were unlucky and the plug in estimate was a function of all of the values \(Y_1^{+},\ldots,Y_n^{+}\)?
For example, suppose we wanted to estimate the variance of the sample median \(\hat{\theta} = {\mathbb F}_n^{-1}(1/2)\)?
In this case, the variance \({\rm Var}_{{\mathbb F}_n}(\hat{\theta})\) is an expectation of a function of \(Y_1^+,\ldots,Y_n^+\).
It isn't clear there is a pretty formula for the variance of the median, but if we let \(X_j\) denote the number of times \(Y_j\) occurs in a bootstrap sample, then: \((X_1,\ldots,X_n) \sim Mult(n; 1/n,\ldots, 1/n)\) so: \[ \sum_{Y^\in \mathcal{S}} \left\{{\mathbb F}_n^{+-1}(1/2) - {\rm E}_{{\mathbb F}_n}[{\mathbb F}_n^{-1}(1/2)]\right\}^2 \frac{n!}{\prod_{i=1}^n x_i!} (1/n)^n\] where \({\mathbb F}_n^{+-1}(1/2)\) is the sample median for each bootstrap sample and \(\mathcal{S}\) is the set of all unique bootstrap samples from \(Y_1,\ldots,Y_n\).
Most of the time, you'll use the bootstrap for parameters where a closed form doesn't necessarily exist. Instead we use a Monte Carlo approach. For the variance of the sample median you would:
As \(b \rightarrow \infty\) you get closer and closer to the exact or ``ideal bootstrap''.
The general form for calculating bootstrap standard errors is similar: